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We show that the way in which the Shannon entropy of sequences produced by an information 
source converges to the source's entropy rate can be used to monitor how an intelligent agent 
builds and effectively uses a predictive model of its environment. We introduce natural measures 
of the environment's apparent memory and the amounts of information that must be (i) extracted 
from observations for an agent to synchronize to the environment and (ii) stored by an agent for 
optimal prediction. If structural properties are ignored, the missed regularities are converted to 
apparent randomness. Conversely, using representations that assume too much memory results in 
false predictability. 



PACS: 02.50.Ey 05.45.-a 05.45.Tp 89.75.Kd 

I. UNTANGLING ENVIRONMENTAL 
STRUCTURE FROM RANDOMNESS 

We examine ways to untangle the different mechanisms 
responsible for apparent randomness observed by an in- 
telligent agent. For our purposes here "intelligent agent" 
simply refers to an observer that (i) actively builds in- 
ternal models of its environment using available sensory 
stimuli and (ii) takes action based on these models. In 
addressing the issue of distinguishing different sources of 
environmental noise and structure, we analyze those as- 
pects of apparent randomness over which an intelligent 
agent may exercise control. These choices include the 
amount of data to collect, as well as the choice of statistic 
or modeling representation used to quantify the degree 
of randomness. 
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FIG. 1. The measurement channel: The internal states 
{A, B,C} of the system are reflected, only indirectly, in the 
observed measurement of Is and Os. An observer works with 
this impoverished data to build a model of the underlying 
system. After Ref. ||l). 

One of the central questions addressed in the follow- 
ing is. How does an agent, apprised of the environment's 
possible states and behaviors, come to know the state 
of its environment? We will show that this is related to 
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another question, How does an agent come to accurately 
estimate how random an environment is? In particular, 
we investigate how finite-data approximations converge 
to an asymptotic measure of randomness by introducing 
several quantities that capture the nature of this conver- 
gence. We demonstrate that regularities that are unseen 
are "converted" to apparent randomness. A more thor- 
ough discussion of these results, including proofs of the 
following propositions and theorems, is found in 



II. MEASUREMENT CHANNEL 

We adapt Shannon's conception of a communication 
channel as follows: We assume that there is an environ- 
ment (source or process) that produces a sensory data 
stream (message) — a string of symbols drawn from a fi- 
nite alphabet {A). The task for the agent (receiver or 
observer) is to estimate the probability distribution of 
sequences and, thereby, estimate how random the en- 
vironment is. Further, we assume that the agent does 
not know the environment's structure; the range of its 
states and their transition structure — the environment's 
internal dynamics — are hidden from the agent. (We will, 
however, relax this assumption in Sec. ^ below.) Since 
the agent does not have direct access to the environ- 
ment's internal, hidden states, we picture instead that 
the agent simply collects blocks of measurements from 
the data stream and stores the block probabilities in 
a histogram (the internal model). In this scheme, the 
agent can estimate, to arbitrary accuracy, the probabil- 
ity of measurement sequences by observing for arbitrary 
lengths of time. 

This measurement channel scenario is illustrated in 
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Fig. |l|. In this case, the environment is a three-state 
deterministic finite automaton. However, the agent does 
not see the internal states {A,B, C}. Instead, it has ac- 
cess only to the measurement symbols {0,1} generated 
on state-to-state transitions by the hidden automaton. 
In this sense, the measurement channel acts like a com- 
munication channel; the channel maps from an internal- 
state sequence . . . BCBAACBC ... to a measurement 
sequence ...0111010.... The environment depicted in 
Fig. 1^ belongs to the class of stochastic process known 
as hidden Markov models. The transitions from internal 
state to internal state are Markovian, in that the prob- 
ability of a given transition depends only upon which 
state the process is currently in. However, these internal 
states are not seen by the agent — hence the name hidden 
Markov model 



The source entropy rate is the rate of increase with 
respect to L of the total Shannon entropy in the large-L 
limit: 



lim 



HiL) 



(2) 



where fi denotes the measure over infinite sequences that 
induces the L-block joint distribution Pr(s^); the units 
are bits/symbol. Alternatively, one can define a finite-L 
approximation to /i^. 



h^^iL) = H[L)-H{L-l) 



(3) 



III. ENTROPIES: MEASURING RANDOMNESS 



and show |^ that — limL^oo h^{L). The entropy rate 
quantifies the irreducible randomness in observation 
sequences produced by the environment — the random- 
ness that persists even after statistics over longer and 
longer blocks of observations are accounted for by the 
agent. 



Let Pr(s^) denote the probability distribution over 
blocks s^ = soj si, . . . , sl_i of L consecutive environ- 
ment observations, Si S A. Then the total Shannon en- 
tropy of these L consecutive measurements is defined to 
be: 

H{L) ^ - J2 Pr(s^)log2Pr(s^) , (1) 

where L > 0. The sum is understood to run over all pos- 
sible blocks of L consecutive symbols. The units of H{L) 
are bits. The entropy H{L) measures the uncertainty as- 
sociated with sequences of length L. (For a more detailed 
discussion of the Shannon entropy and related informa- 
tion theoretic quantities, see, e.g., Ref. |].) Below, we 
will focus on the behavior of the Shannon entropy curve 
H{L). We shall see that examining how H{L) grows with 
L leads to several quantities that capture aspects of the 
environment's randomness and structure. 




FIG. 2. Total Shannon entropy growth for a finitary in- 
formation source: a schematic plot of H{L) versus L. H{L) 
increases monotonically and asymptotes to the line E + h^L, 
where E is the excess entropy and /i^ is the source entropy 
rate. The shaded area is the transient information T. For 
more discussion, see text. 



IV. EXCESS ENTROPY: MEASURING MEMORY 

Having looked at length-L sequences, an agent can es- 
timate the true randomness ft.^ by calculating /i^(L), de- 
fined in Eq. (|). With enough sensory data it can get 
good approximations to /i^ by using long sequences. But 
what if there is insufficient data to allow this? To an- 
swer this we must determine how the estimates h^j,{L) 
converge to h^l One measure of convergence is provided 
by the excess entropy E: 

oo 

E ^ ^[/i^(L)-M , (4) 

L=l 

The units of E are bits. The excess entropy is not a new 
quantity; it was first introduced almost two decades ago 
1^,0. For recent reviews see 

E measures the convergence of h^{L) and plays a role 
in determining how an agent comes to know how random 
its environment is. But what exactly does E measure? 
The length-L approximation hf^{L) overestimates the en- 
tropy rate at finite L by an amount hf,^{L) — h^. This 
difference measures how much more random single mea- 
surements appear using the finite L-block statistics than 
the statistics of infinite sequences. In other words, this 
excess randomness tells us how much additional infor- 
mation must be gained from the environment in order 
to reveal the actual per-symbol uncertainty /i^. Thus, 
we can think of the difference hf^{L) — ft,^ as the redun- 
dancy (per symbol) in length-L sequences: that portion 
of information-carrying capacity in the L-blocks which is 
not actually random, but is due instead to correlations. 
The excess entropy E, then, is the total amount of this 
redundancy and, as such, a measure of one type of mem- 
ory intrinsic to an environment. 

The next proposition establishes a geometric interpre- 
tation of E and an asymptotic form for H{L). 
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Proposition 1 The excess entropy is the subextensive 
part of H{L); that is, 

E = lim [H{L) - hf,L] . (5) 



This proposition implies the following asymptotic form 
for H{L): 

H{L) - E + h^,L , as L ^ oo . (6) 

Thus, we see that E is the L — intercept of the linear 
function Eq. to which H{L) asymptotes. This obser- 
vation, also made in Refs. [0,^-0, is shown graphically 
in Fig. |. 

Another way to understand excess entropy is through 
its expression as a type of mutual information. 

Proposition 2 The excess entropy is the mutual infor- 
mation between the past and the future: 

E= lim I[soSi- ■ ■ S2L-i;S2LS2L+lS2L-l] , (7) 

L — >oo 

when the limit exists. 

Eq. (0) says that E measures the amount of historical 
information stored in the present that is communicated 
to the future. For a discussion of some of the subtleties 
associated with this interpretation, however, see Ref. 
Prop. 1^ also shows that E can be interpreted as the cost 
of amnesia: If an agent suddenly loses track of its envi- 
ronment, so that it cannot be predicted at an error level 
determined by the entropy rate /i^ , then the environment 
appears more random by a total of E bits. 

V. TRANSIENT INFORMATION: A MEASURE 
OF SYNCHRONIZATION 

We now introduce a quantity that answers the ques- 
tion. How does H{L) converge to its asymptote E-|-/i^L? 
That is, when has an agent made a sufficient number of 
observations that it can determine the complexity of its 
environment? The answer to these questions is provided 
by the transient information T: 

T = ^ [E + /i^L - H{L)] . (8) 

Note that the units of T are bits x symbols. The tran- 
sient information is a new quantity, recently introduced 
by us in Ref. 

Thus, for finite-memory (E and T finite) processes 
H{L) scales as E h^L for large L. When this scaling 
form is attained, we say that the agent is synchronized to 
the environment. In other words, when 

T(i) = E + hf,L - H{L) = , (9) 



we say the agent is synchronized at length-L sequences. 
As we will see below, agent-environment synchronization 
corresponds to the agent being in a condition of knowl- 
edge such that it can predict the environmental obser- 
vations at an error rate commensurate with to the en- 
vironment's entropy rate h^. We refer to T as transient 
since during synchronization the agent's prediction prob- 
abilities change, stabilizing only after it has collected a 
sufficient number of observations. 

To ground this interpretation, we can establish a di- 
rect relation between the transient information T and the 
amount of information required for synchronization to 
block-Markovian environments. Assume that the agent 
has a correct model A4 = {V, T} of the environment, 
where V is a set of states and T is the rule governing 
transitions between states. The task for the agent is to 
make observations and determine the state u G V of the 
environment. Once the agent knows with certainty the 
current state, it is synchronized to the environment, and 
the average per-symbol uncertainty is exactly h^. We are 
interested in describing how difficult it is to synchronize 
to a directly observed Markov process. 

The agent's knowledge of V is given by a distribution 
over the states v GV. Let Pr(i)|s^,7W) denote the distri- 
bution over V, given that the particular sequence has 
been observed and the agent has internal model M . The 
entropy of this distribution measures the agent's average 
uncertainty in predicting v £ V. Averaging this uncer- 
tainty over the possible length-L sequences, we obtain 
the average agent- environment uncertainty: 

n{L) EE 

-^Pr(s-^)^Pr(?;|s-^,7W)log2Pr(t;|s-^,7W) . (10) 

The quantity TL{L) can be used as a criterion for synchro- 
nization. The agent is synchronized to the environment 
when H{L) = — that is, when the agent is completely 
certain about the state -y £ V of the mechanism generat- 
ing the sequence. When the condition in Eq. (||) is met, 
we see that H{L) — 0, and the uncertainty associated 
with the prediction of the model M is exactly h^. 

However, while the agent is still unsynchronized 
'H{L) > 0. We refer to the total average uncertainty 
experienced by an agent during the synchronization pro- 
cess as the synchronization information S: 

oo 

S^^H(i). (11) 

The synchronization information measures the average 
total information that must be extracted from observa- 
tions so that the agent is synchronized. 

In the following, we assume that the environment is 
Markovian of order R. In contrast to the scenario de- 
picted in Fig. 1^, we assume that the Markov model is 
not hidden, in the sense that internal states are directly 
observable. 
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Theorem 1 For an order-R Markovian environment, 
the synctironization information S is given by: 

S = T + ^R{R+l)h^ . (12) 

Thus, the transient information T — together with the 
entropy rate /i^ and the order R of the Markov process — 
measures how difficult it is to synchronize to an environ- 
ment. If a system has a large T, then, on average, an 
agent will be highly uncertain about the internal state 
of the environment while synchronizing to it. Thus, T 
measures a structural feature of the environment: how 
difficult it is for an agent to synchronize to it. 

VI. APPLICATIONS AND IMPLICATIONS 

Using hfj_, E, and T one can distinguish various types 
of entropy convergence and different structural classes of 
environment. We can now return to the set of questions 
posed in the introduction: How can we untangle differ- 
ent sources of apparent randomness? In particular, what 
happens to estimates of the environment's randomness if 
we ignore its structure? 

Here we show that there are direct and empirically im- 
portant consequences for ignoring structural properties. 
Namely, missed regularities are converted to apparent 
randomness, assumed memory produces false predictabil- 
ity, and assumed synchronization leads to memory un- 
derestimates. These result in a range of misleading infer- 
ences about both the environment's randomness and its 
structure. We consider four different issues: 

1. What happens when an agent ignores entropy-rate 
convergence? 

2. What happens when the environment's apparent 
memory is ignored? 

3. What happens if the agent ignores synchroniza- 
tion? 

4. What happens if the agent assumes it is synchro- 
nized to the environment, when it is not? 



A. Disorder as the Price of Ignorance 

The first two questions are closely related and rather 
straightforward to answer. The preceding sections de- 
fined several different quantities — /i^, E, and T — that 
measure randomness, memory, synchronization, and 
other features of a process. For the most part, these 
are asymptotic quantities in the sense that they involve 
the behavior of the function H{L) in the L — > oo limit. 
Thus, their exact empirical estimation demands that an 
infinite number of measurements (for accurate estimates 



of sequence probabilities) of infinitely long sequences be 
made. Obviously, other than by analytic means, it is 
not possible to calculate exactly such quantities. Exact, 
i — > CX3 results are known for only a few special processes 
that are analytically tractable. 

This leads one to ask, Even when sequence probabil- 
ities are accurately known, how well can these various 
environment properties be estimated at finite L? What 
errors are introduced, and are these errors related in any 
way? 

The simplest such question, the first one listed above, 
arises when one attempts to estimate source randomness 
hfj^ via the approximation h^[L). Generally, stopping 
the estimate at finite L gives one a rate hfi{L) which is 
larger than the actual rate /i^. That is, the environment 
appears more random if we ignore correlations between 
observations separated by more than L steps. For a dis- 
cussion of several methods to improve on the estimator 
hi^i{L) in the context of dynamical systems, see Rcf. [ p^ . 



^(L)+hiL)L 









H{L) 



L-7 L 

FIG. 3. Ignored memory is converted to randomness: Il- 
lustration of how ignoring memory, in this case implicitly as- 
suming E = as Eq. (^) implies, when actually E > 0, leads 
to an overestimate ftp(i) (slope of dotted line) of the actual 
entropy rate (slope of dashed line). 

An agent could also estimate ft.^ at finite L by us- 
ing h!^(L) = H[L)/L, as suggested by Eq. (^). Using 
this definition to estimate is tantamount to assum- 
ing that E = 0, as illustrated by the dashed fine h'^L 
in Fig. Ij. Now suppose an agent makes measurements 
of an environment with entropy rate /i^ and excess en- 
tropy E > 0. Then, at a given L, we can ask what the 
entropy rate estimate h'^{L) = H{L)/L is. As shown 
in 1^, h'^{L) > hf^{L) > /i^. But how much more ran- 
dom docs the environment appear? This is answered in 
a straightforward way by the following proposition. 

Proposition 3 When the agent is synchronized to the 
environment, 

h'^{L)-h^ = ^. (13) 



4 



In this way, E bits of memory are converted into addi- 
tional, apparent randomness. The environment appears 
more random due to the agent's ignoring one of its struc- 
tural properties. 

Although E is an i-asymptotic quantity, the error E/L 
in the entropy-rate estimate dominates at small L. More- 
over, being restricted to small L is typical of experimental 
situations with limited data or in which drift is present. 
One cannot reliably estimate the L-block probabilities 
Pr(s^) at large L due to the exponential growth in their 
number or the nonstationarity of block probabilities, re- 
spectively. 



B. Predictability and Instantaneous Synchronization 

Conversely, if one assumes a fixed amount of memory 
E, we shall see that this leads to an underestimate of 
the entropy rate and the environment appears more 
predictable than it is. Assuming a fixed excess entropy 
is not something that one is likely to do in the particu- 
lar setting here, in which an agent empirically measures 
entropy density and related quantities from observation 
sequences. In a more general modeling setting, however, 
one always runs the risk of using too large a model and, 
in so doing, "projecting" some particular structure — such 
as, additional memory capacity — onto the environment. 
Assuming a fixed, nonzero value for the excess entropy 
is, in an abstract sense, an example of over-fitting. Given 
this, we ask, What is the consequence of assuming a fixed 
value for E? 

Equivalently, what happens if the agent assumes that 
it is synchronized to the environment at some finite L, 
implying that H{L) = E + h^L at that LI The geomet- 
ric construction for this scenario is given in Fig. |. In 
effect the environment is erroneously considered to be a 
completely observable Markovian process in which H{L) 
converges to its asymptotic form exactly at some finite 
L |P,p^. If the agent then uses its assumed value for E, 
one arrives at the estimator where 



(15) 



H{L) - E 



(14) 



At a given L the effect is that the agent considers the en- 
vironment to have a larger E than it actually has at that 
L. The line E -I- h^L appears fixed at E when that inter- 
cept should be lower at the given L. The result, easily 
gleaned from Fig. ^, is that the entropy rate is under- 
estimated as hf^. (The two entropy rates are the slopes 
of the two straight lines.) In other words, the agent will 
believe the environment to be more predictable than it 
actually is. 

Proposition 4 An agent monitors an environment 
with excess entropy E > 0. // the agent assumes it is 
synchronized when it is not, then 




H(L) 



E > ' 



E = 0- 



FIG. 4. Assumed synchronization converted to false pre- 
dictability: Schematic illustration of how assuming one is syn- 
chronized, leads to an underestimate hf^ (slope of dotted line) 
for an environment with excess entropy E > and entropy 
rate ft^ (slope of dashed line). 



C. Assumed Synchronization Implies Reduced 
Apparent Memory 



In addition to analyzing the effects on the apparent en- 
tropy rate due to assuming synchronization, we can ask 
a complementary question: What are the effects of as- 
suming synchronization on estimates E of the apparent 
memory? Figure ^ illustrates this situation. 



E = 0-- 



E+h^L 




E +/i^L 



H(L) 



FIG. 5. Assumed synchronization leads to less apparent 
memory: Schematic illustration of how assuming synchro- 
nization to an environment, in this case implicitly assuming 
H{L) = E -I- hf_tL, leads to an underestimate E of the actual 
memory E > 0. 
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If, at a given L, we approximate the entropy rate es- 
timate h^{L) = H{L) — H{L — 1) by the true entropy 
/i^, then the offset between the asymptote and H{L) is 
simply E + h^L — H{L). Thus, looking at Fig. ||, we see 

that we have a reduced apparent memory E < E of 

E = H{L) - hf,L . (16) 

In fact, since the estimated entropy rate is larger than ft,^, 
the reduction in apparent memory is even larger. Thus, 
assuming synchronization, in the sense that h^{L) — h^, 
leads one to underestimate the apparent memory, as mea- 
sured by the excess entropy E. And so, the environment 
appears less structurally complex than it is. 

VII. CONCLUSION 

We have reviewed several information theoretic mea- 
sures of an environment's randomness and several of its 
structural properties. We also introduced a new quantity, 
the transient information T. One of the central results 
of this work is contained in Theorem |^, which states that 
T is directly related to the total agent-environment un- 
certainty experienced while an agent synchronizes to a 
Markovian environment. 

We then considered various trade-offs between finite-L 
estimates of the excess entropy E and the entropy rate 
h^. In particular, we showed that if an agent does not 
take one or another into account it will systematically 
over- or underestimate an environment's entropy rate ft,^. 
For example, there can be an inadvertent conversion of 
ignored memory into apparent randomness. The mag- 
nitude oa f this effect is proportional to the difference 
between environment memory and the upper bound on 
memory that the agent store in its internal model. In 
a complementary way, one can inadvertently convert as- 
sumed memory into false predictability. As a result, an 
agent must have some method for accounting for an envi- 
ronment's structural features, even if it's focus is only on 
the apparently simple question of how random a process 
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